Skip to content

fix(workflow): enforce per-task timeout in workflow execution#479

Open
brycehemme wants to merge 1 commit into
strands-agents:mainfrom
brycehemme:fix/workflow-enforce-task-timeout
Open

fix(workflow): enforce per-task timeout in workflow execution#479
brycehemme wants to merge 1 commit into
strands-agents:mainfrom
brycehemme:fix/workflow-enforce-task-timeout

Conversation

@brycehemme
Copy link
Copy Markdown

Description

Issue #326 reported that the workflow tool accepts a per-task timeout field (and applies a 300s default in create) but never enforces it. A hung task blocks the entire workflow loop indefinitely, since wait(active_futures.values(), return_when=FIRST_COMPLETED) is called without a timeout and future.result() is likewise unbounded.

This PR enforces the per-task timeout in two places:

  • Bound the wait. Compute the soonest active-task deadline (start_time + timeout minimum across all active futures) and pass it as timeout= to wait(). The loop now wakes up no later than the next task limit, even if no future completes.
  • Mark timed-out futures as error. After wait() returns, iterate every active future. Done futures process as before. Still-running futures get checked against their per-task timeout — if exceeded, the task is marked status: "error" with "Task execution timeout after Ns", added to completed_tasks, and removed from active_futures. future.cancel() is called as a best effort.

Two small helpers are added to WorkflowManager: _get_task_timeout(workflow, task_id) and _next_task_deadline(active_futures, workflow).

Caveats (worth surfacing for review)

  1. Worker thread can't be terminated. Python doesn't expose a portable way to kill a running thread, so future.cancel() on a started future returns False and the worker function keeps running until it returns. This PR only unblocks the workflow loop and dependent tasks; a runaway tool function still consumes CPU until it finishes. An inline comment in start_workflow explains the constraint. The reporter's example (a time.sleep(120) tool with timeout: 30) is handled correctly under this constraint: the workflow finishes around 30s, the underlying sleep keeps running another ~90s in the background.

  2. Related to issue [BUG] Workflow deadlocks indefinitely when task fails and blocks dependent tasks #325 (deadlock on failed tasks) but doesn't fix it. The reporter noted that without timeout enforcement, hung tasks never fail and so never trigger the dependent-task-deadlock condition described in [BUG] Workflow deadlocks indefinitely when task fails and blocks dependent tasks #325. This PR makes hung tasks fail, which exposes [BUG] Workflow deadlocks indefinitely when task fails and blocks dependent tasks #325 — dependent tasks may now be blocked instead of the workflow itself. [BUG] Workflow deadlocks indefinitely when task fails and blocks dependent tasks #325 should be tracked separately.

Related Issues

Fixes #326

Type of Change

Bug fix

Testing

Ran hatch run prepare end-to-end via python:3.12 in Docker:

docker run --rm -v "$PWD:/work" -w /work python:3.12 bash -c \
  "git config --global --add safe.directory /work && pip install --quiet hatch && hatch run prepare"
  • format: clean (no changes to PR files)
  • lint: clean
  • test-lint (ruff format --check): clean
  • test: 1211 passed, 33 skipped, 1 failure

The single failure is tests/test_file_write.py::test_file_write_error_handling. I verified it reproduces on main without this PR (same Python 3.12 / hatch environment), so it's pre-existing and unrelated.

Added test_start_workflow_enforces_task_timeout (in TestWorkflowExecution): installs a hung execute_task that sleeps 2.0s, configures timeout: 0.2, asserts the workflow completes in under 1.5s with the task marked status: "error" and the result text containing the timeout. Resets the WorkflowManager singleton around the test to avoid leaking the patched execute_task / shut-down TaskExecutor into other tests.

  • I ran hatch run prepare

Checklist

  • I have read the CONTRIBUTING document
  • I have added any necessary tests that prove my fix is effective or my feature works
  • I have updated the documentation accordingly (inline comment in start_workflow explaining the thread-termination caveat)
  • I have added an appropriate example to the documentation to outline the feature (N/A — bug fix to existing behavior)
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

The workflow tool accepted a `timeout` field per task and applied a 300s
default, but the value was never enforced. `wait()` was called without a
timeout and `future.result()` was called without a timeout, so a hung
task blocked the workflow loop indefinitely.

Track per-task timeouts in two places:

* Compute the soonest active-task deadline and pass it as `timeout=` to
  `wait()`, so the loop wakes up no later than the next task limit.
* After `wait()` returns, check every still-running future against its
  per-task timeout. Tasks past their deadline are marked `error` with a
  `"Task execution timeout after Ns"` message and removed from
  `active_futures`, so the workflow can progress instead of blocking.

`future.cancel()` is attempted for timed-out tasks but cannot reliably
terminate an already-running worker thread (Python doesn't expose a
portable mechanism). The runaway task continues consuming CPU until its
function returns; this PR only unblocks the workflow loop and dependent
tasks. Documented inline.

Adds `test_start_workflow_enforces_task_timeout`: installs a hung
`execute_task` that sleeps 2.0s, configures `timeout: 0.2`, and asserts
the workflow returns in under 1.5s with the task marked as a timeout
error. Resets the `WorkflowManager` singleton around the test to avoid
leaking state.

Fixes strands-agents#326
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Workflow - Task timeout parameter is not enforced - tasks can run indefinitely

1 participant